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Amendments to the Claims 

This listing of claims will replace all prior versions of claims in the application: 
Listing of Claims: 

1. (Currently Amended) A system for summarizing audio information, comprising: 
an analyzer to convert audio into frames; 

a fingerprinting component to convert the frames into fingerprints, each fingerprint based 
in part on a plurality of frames; 

a similarity detector to compute similarities between fingerprints, the similarity detector 
comprising a clustering function, the clustering function producing one or more sets of clusters 
of fingerprints based upon all fingerprints within a set of clusters meeting an initial threshold 
indicative of similarity ; 

a heuristic module to generate a thumbnail of the audio file from a set of clusters that has 
at least two gaps between fingerprints, wherein a gap is a temporal space between two adjacent 
fingerprints that exceeds a predetermined threshold when fingerprints within a set of clusters are 
placed in sequential temporal orde r , based in part on the similarity between fingerprints . 

2. (Original) The system of claim 1, the heuristic module comprising at least one of an 
energy component and a flatness component in order to help determine a suitable segment of 
audio for the thumbnail. 

3. (Original) The system of claim 2, the heuristic module is employed to automatically 
select voiced choruses over instrumental portions. 

4. (Original) The system of claim 2, the energy component and the flatness component are 
employed when the fingerprints do not result in finding a suitable chorus. 

5. (Original) The system of claim 1, further comprising a component to remove silence at 
the beginning and end of an audio clip via an energy-based threshold. 
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6. (Original) The system of claim 1, the fingerprint component further comprising a 
normalization component, such that an average Euclidean distance from the each fingerprint to 
other fingerprints for an audio clip is one. 

7. (Original) The system of claim 1, the analyzer computes a set of spectral magnitudes for 
an audio frame. 

8. (Original) The system of claim 7, for each frame, a mean, normalized energy E is 
computed by dividing a mean energy per frequency component within the frame by the average 
of that quantity over frames in an audio file. 

9. (Original) The system of claim 8, further comprising a component that selects a middle 
portion of an audio file to mitigate quiet introduction and fades appearing in the audio file. 

10. (Original) The system of claim 2, the flatness component employs a number that is added 
to spectral magnitudes for each frequency component, to mitigate numerical problems when 
determining logs. 

11. (Original) The system of claim 10, the flatness component includes a frame-quantity 
computed as a log normalized geometric mean of the spectral magnitudes. 

12. (Original) The system of claim 1 1, the normalization is performed by subtracting a per- 
frame log arithmetic mean of a per-frame magnitudes from the geometric mean. 

13. (Currently Amended) The system of claim 1, the heuristic component selects the set of 
clusters from which to generate the audio thumbnail based upon at least one of a mean spectral 
quality value determined for the set of clusters or a cluster spread quality value determined for 
the set of clusters the similarity detector comprising a clustering function, the clustering function 
producing clusters of similar fingerprints . 
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14. (Currently Amended) The system of claim 13, the heuristic component selects the set of 
clusters that has the highest value for the sum of the square of the mean spectral quality value 
determined for the set of clusters and the cluster spread quality value determined for the set of 
clusters 



1 5 . (Currently Amended) The system of claim 1 [[4] ] , the initial threshold is a normalized 
Euclidian distance between two fingerprints further comprising a fingerprint Fl or an identifying 
index related to Fl that is added to a cluster containing fingerprint F2 in the cluster set if Fl and 
F2 satisfy at least two conditions: with respect to a first condition, a normalized Euclidean 
distance from Fl to F2 is below a first threshold, and with respect to a second condition, a 
temporal gap in an audio between where Fl is computed and where F2 is computed is above a 
second threshold . 



16. (Currently Amended) The system of claim 1, wherein a cluster is a group of fingerprints 
in a set of clusters that lies between the same two gaps or lies between the beginning of the 
sequence of fingerprints and the first gap in the sequence or lies between the last gap in the 
sequence and the end of the sequence of fingerprints A computer readable medium having 
computer readable instructions stored thereon for implementing the system of claim 1 . 



17. (Currently Amended) An automatic thumbnail generator, comprising: 
means for converting an audio file into frames; 

means for fingerprinting the audio file, producing fingerprints based in part on a plurality 
of frames; and 

means for producing one or more sets of clusters of fingerprints based upon all 
fingerprints within a set of clusters meeting a predefined similarity threshold; and 

means for creating determining an audio thumbnail by selecting a set of clusters that has 
at least two gaps between fingerprints, wherein a gap is a temporal space between two adjacent 
fingerprints that exceeds a predetermined threshold when fingerprints within a set of clusters are 
placed in sequential temporal order based in part on the fingerprints . 
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18. (Currently Amended) A method to generate audio thumbnails, comprising: 
generating a plurality of audio fingerprints, each audio fingerprint based in part on a 

plurality of audio frames; 

producing one or more sets of clusters of fingerprints based upon all fingerprints within a 
set of clusters meeting a similarity threshold; and clustering the plurality of fingerprints into 
fingerprint clusters; and 

creating a thumbnail based upon set of clusters that has at least two gaps between 
fingerprints, wherein a gap is a temporal space between two adjacent fingerprints that exceeds a 
predetermined threshold when fingerprints within a set of clusters are placed in sequential 
temporal order based in part on the fingerprint clusters . 

19. (Currently Amended) The method of claim 18, clustering fingerprints within a set of 
clusters into fingerprint clusters based upon the gaps the clustering further producing one or 
more cluster sets, each cluster set comprising fingerprint clusters . 

20. (Currently Amended) The method of claim 18[[9]], the similarity threshold is a 
normalized Euclidian distance between two fingerprints the clustering further comprising 
determining whether a cluster set has three or more fingerprint clusters . 

21 . (Currently Amended) The method of claim 1 8, the similarity threshold clustering based 
in part on a threshold, the threshold chosen adaptively based upon the for an audio file and used 
to help determine if two fingerprints belong to the same cluster set. 

22. (Currently Amended) The method of claim 19[[8]], the clustering operating by 
considering one fingerprint at a time. 

23. (Currently Amended) The method of claim 19 [[8]], further comprising determining a 
parameter (D) describing how evenly spread clusters are, temporally, throughout an audio file. 
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24. (Currently Amended) The method of claim 23, selecting the set of clusters from which to 
generate the audio thumbnail based upon at least parameter (D) wherein a measure of temporal 
spread is applied to the clusters in a given cluster set . 

25. (Original) The method of claim 24, (D) is measured as follows: 
normalizing a song to have duration of 1; 

setting a time position of an i'th cluster be t i ; 
defining t 0 □ 0 and t N+l □ 1 ; and 

computed as ~ (l ~ ' ^ _r /-i) 2 ) where N is a number of clusters in a cluster set. 

26. (Original) The method of claim 25, further comprising determining an offset and scaling 
factor so that (D) takes a maximum value of 1 and minimum value of 0, for any N . 

27. (Original) The method of claim 25, further comprising determining a mean spectral 
quality for fingerprints in a set. 

28. (Original) The method of claim 27, wherein a mean spectral flatness for a set, and a 
parameter D, are combined to determine a best cluster set from among a plurality of cluster sets. 

29. (Original) The method of claim 28, the mean spectral flatness and parameter D are 
combined into a single parameter associated with each cluster set, such that the set with the 
external value of the parameter is selected to be the best set. 

30. (Original) The method of claim 29, when the best cluster set is selected, a best fingerprint 
within the cluster set is determined as the fingerprint in which surrounding audio, of duration 
about equal to a duration of an audio thumbnail, has maximum spectral energy or flatness. 

31. (Original) The method of claim 18, the creating further comprising determining a cluster 
by determining a longest section of audio within an audio file that repeats in the audio file. 
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32. (Original) The method of claim 18, the creating further comprising at least one of: 
rejecting clusters that are close to a beginning or end of a song; 

rejecting clusters for which energy falls below a threshold for any fingerprint in a 
predetermined window; and 

selecting a fingerprint having a highest average spectral flatness measure in the 
predetermined window. 

33. (Original) The method of claim 18, the creating further comprising generating a 
thumbnail by specifying time offsets in an audio file. 

34. (Original) The method of claim 18, the creating further comprising automatically fading a 
beginning or an end of an audio thumbnail. 

35. (Original) The method of claim 18, the generating further comprising processing an audio 
file in at least two layers, where the output of a first layer is based on a log spectrum computed 
over a small window and a second layer operates on a vector computed by aggregating vectors 
produced by the first layer. 

36. (Original) The method of claim 35, further comprising providing a wider temporal 
window in a subsequent layer than a proceeding layer. 

37. (Original) The method of claim 36, further comprising employing at least one of the 
layers to compensate for time misalignment. 
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